Extending and Implementing the Self-adaptive Virtual Processor for 
Distributed Memory Architectures 
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Abstract 

Many-core architectures of the future are likely to 
have distributed memory organizations and need fine 
grained concurrency management to be used effectively. 
The Self-adaptive Virtual Processor (SVP) is an ab- 
stract concurrent programming model which can pro- 
vide this, but the model and its current implementa- 
tions assume a single address space shared memory. 
We investigate and extend SVP to handle distributed en- 
vironments, and discuss a prototype SVP implementa- 
tion which transparently supports execution on hetero- 
geneous distributed memory clusters over TCP/IP con- 
nections, while retaining the original SVP programming 
model. 



1. Introduction 

As processor architectures are moving into the 
many-core era, potentially scaling up to more than 
1000s of cores on a chip J6][T), it becomes infeasible 
to maintain a memory model which guarantees system- 
wide sequential consistency. Full cache-coherence will 
not scale for such architectures 1|25] [351 or will suffer 
from large latencies, so future many-core architectures 
are likely to have a more distributed and weakly consis- 
tent memory design. For example, these could be orga- 
nized in a similar way as the experimental 48-core Intel 
SCC research chip [28], on which each processor can 
access both a private and shared memory, but no hard- 
ware cache coherence is provided. In order to exploit 
many-cores to their full potential, it is essential to be 
able to create parallelism at a fine granularity in order 
to expose the maximum amount of concurrency. We re- 
quire a programming model to express this concurrency, 
but which can also handle such distributed memory or- 
ganizations efficiently. 

In this report, we apply and adapt the definition of 
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the Self-adaptive Virtual Processor (SVP) to distributed 
memory organizations, naming this extension DSVr-Q. 
SVP is an abstract concurrent programming and ma- 
chine model ll29l . which evolved from the earlier work 
on the Microthread CMP architecture |]7] . It can be used 
to express concurrency at many levels of granularity for 
multi- or manycore systems, and uses weakly consistent 
shared memory semantics. As SVP is a generic model 
to program parallel systems, this method can be applied 
to the whole spectrum of memory organizations; from 
cc-NUMA machines where you want to maintain lo- 
cality, to non cache coherent shared memory machines 
such as the Intel SCC or other future many-core archi- 
tectures, and even a cluster of nodes on a network, i.e. 
a heterogeneous distributed system. This is achieved 
by extending SVP implementations and the way they 
are programmed to support distributed memory spaces, 
and by translating SVP actions into messages in a dis- 
tributed environment. Using this approach, we believe 
that we have made a step forward in efficiently targeting 
any architecture within the aforementioned spectrum. 

In order to go into further details of this work, we 
will give a short introduction to the semantics and ac- 
tions of the SVP model (Section |2}, and we describe 
its current memory consistency model. We then define 
how we can apply SVP to a distributed environment, at 
which level of granularity we can identify and distribute 
software components, and how we identify their depen- 
dencies in order to communicate data between nodes 
in Section [3] We then discuss our research prototype 
that implements these techniques using messages over 
TCP/IP in SectionHJ and show that this follows the orig- 
inal SVP memory consistency model. This implemen- 
tation is then evaluated and discussed (Section|5]l, where 
we show that this approach integrates nicely with SVP 



1 We use the name DSVP throughout the report for matters specific 
to our extension, and SVP for anything that applies to both the original 
and extended model. 



as SVP actions are handled transparently and uniformly 
between local and remote executions. This discussion 
is continued in Section [6] where we compare it with 
a broad spectrum of related approaches in distributed 
computing. We then conclude in Section|7] 

2. The SVP Model 

SVP is a generic concurrent programming and ma- 
chine model (29), of which both coarse [41] and fine 
1 30 1 grained implementations are available. The goal of 
SVP is to be able to express concurrency, without hav- 
ing to explicitly manage it. The /iTC language 11321 . 
based on C99, has been defined to capture the seman- 
tics of SVP. This language is used to drive several SVP 
implementations, as it extends traditional C with syntax 
to express all SVP actions. 

The SVP model defines a set of actions to express 
concurrency on groups (families) of indexed identical 
threads. Each thread can execute a create action to start 
a new concurrent child family of threads, and later on 
use the sync action to wait for its termination, imple- 
menting a fork-join style of parallelism. The create ac- 
tion has a set of parameters to control the number and 
sequence of created threads, as well as a reference to 
the thread function that the threads will execute. This 
thread function can have arguments, defined by SVP's 
communication channels explained later on. 

As any thread can create a new family, the concur- 
rency in a program can consist of many hierarchical lev- 
els, often referred to as the concurrency tree of a pro- 
gram. Besides these two basic constructs, there is the 
kill action to asynchronously terminate an execution. 

Resources SVP code has no notion of what resources 
are, as it is resource and scheduling naive. However, 
the concept of place is provided as an abstract resource 
identifier. On a create action a place can be specified 
where the new family should be created, binding the ex- 
ecution onto a certain resource. What this place physi- 
cally maps to, is left up to the SVP implementation; for 
example, on a many-core architecture like the Micro- 
grid, it could be a group of processors. On other imple- 
mentations it could, for example, be a reserved piece of 
FPGA fabric, an ASIC, or some time-sliced execution 
slot on a single- or multi- processor system. As long 
as the underlying implementation supports it, multiple 
places can be virtualized onto a single resource. 

There is one important property that a place can 
have; it can be exclusive. This means that each create 
on such an exclusive place will be sequentialized. Only 
one family can be executing on such a place at a time, 
providing us with a mutual exclusion mechanism. 



Communication and Synchronization Synchro- 
nized communication is provided through a set of 
channels, which run between threads in a family and 
their parent thread. There are two types of unidirec- 
tional write-once channels; global and shared of which 
multiple can be present. These channels have non- 
blocking writes and blocking reads. A global channel 
allows vertical communication in the concurrency tree 
from the parent thread to all threads in the family. A 
shared channel allows horizontal communication, as 
it daisy-chains through the sequence of threads in the 
family, connecting the parent to the first thread and 
the last thread back to the parent. These channels are 
defined as arguments of a thread function, similar to 
normal function arguments, and identify the data de- 
pendencies between the threads. Due to this restricted 
definition, and under restricted use of exclusive places, 
we can guarantee that the model is composable and 
free of communication deadlock [43 1. Furthermore, 
this implies that every family of threads has a very well 
defined sequential schedule if concurrent execution 
is infeasible, as it is guaranteed that a family can run 
to completion when all of its threads are executed in 
sequence. This enables program transformations that 
sequentialize families into loops at the leaves of the 
concurrency tree, allowing us to adapt the granularity 
and amount of exposed concurrency in an SVP program 
for a specific platform. 

Memory Consistency The model assumes a global, 
single address space, shared memory. However, this is 
seen as asynchronous and has a restricted consistency 
model. Therefore it is not suitable for synchronizations, 
and no explicit memory barriers or atomic operations 
are provided. The consistency model is described by 
the following three rules: 

• Upon creation, a child family is guaranteed to see 
the same memory state as the parent thread saw at 
the point where it executed create. 

• The parent thread is guaranteed to see the changes 
to memory by a child family only when sync on 
that family has completed. 

• Subsequent families created on an exclusive place 
are guaranteed to see the changes to memory made 
earlier by other families on that place. 

The memory consistency relationship between parent 
and child threads somewhat resembles the well-known 
release consistency model JT9). In that sense, the point 
of create resembles an acquire, and the point of sync re- 
sembles the release. We should note that the third rule 



is a very important property as it can be used to im- 
plement communication between two arbitrary threads, 
but it can also be used to implement a service; state is 
resident at the exclusive place and instances of the func- 
tions implementing that service are created on the place 
by its clients. An example of such a service has been 
presented in [26 1 for the S VP based Microgrid architec- 
ture. 

Data passed through the global or shared channels 
is always considered consistent. However, it is likely 
that in certain implementations the channels are lim- 
ited to only scalar values, therefore a reference to a 
datastructure in memory would be passed instead of the 
structure itself. An implementation then has to guaran- 
tee that there is memory consistency for the referenced 
structure when it is read from the channel. 

Example The basic concepts of SVP are illustrated in 
Figure [TJ and Figure [2] using some example code that 
generates a Fibonacci sequence and stores it in an ar- 
ray. It must be noted that this example yields little ex- 
ploitable concurrency, but is merely used as a simple 
illustration of the concepts. 

1 thread fibonacci ( shared int pi, 

2 shared int p2 , int* result) 

3 { 

4 index i ; 

5 result[i]=pl+p2; 

6 p2 = pi ; 

7 pi = r e s u 1 1 [ i ] ; 

8 } 
9 

10 main() 

11 { 

12 family fid; 

13 place pid = PLACE.DEFAULT ; 

14 int result[N]; 

15 int a= r e s ul t [ 1 ] = 1; 

16 int b = result [0] = 0; 
17 

18 create(fid ;pid ;2;N;;) 

19 fibonacci(a, b, result); 

20 sync (fid); 

21 } 

Figure 1. Fibonacci code example 

In Figure [TJ we show the C-like /xTC code that im- 
plements Fibonacci, with the iterations of the algorithm 
defined as a thread function in lines 1-8. The defini- 
tion on lines 1 and 2 identifies the shared channels for 
the two dependencies in Fibonacci, as well as a global 



that will pass the pointer for the result array. The shared 
channels are read implicitly on line 5, and written on 
lines 6 and 7. Line 10 to 21 show the main function 
of the program that will start the concurrent Fibonacci 
iterations. Line 12 defines a variable that can hold a 
family identifier which is set by the create on line 18. 
Line 13 defines a place identifier which is set to a de- 
fault defined by the SVP implementation. Then the ini- 
tial values for the algorithm are set in lines 15 and 16, 
and the spawn of concurrent iterations is done with the 
create statement in lines 18 and 19 creating a family of 
indexed threads from 2 to on the place identified by 
pid. The two omitted parameters can be used to further 
control the creation and indexing of threads by step and 
block size. Information to identify the created family 
is stored in fid, and the sync statement on line 20 uses 
this to have the main thread wait until all threads in the 
Fibonacci family have terminated. On line 19, the vari- 
ables a and b are used to initialize the shared channels 
for the fibonacci family, providing the values that the 
first thread will read, as well as the pointer to the array 
to store the results. 
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Figure 2. Fibonacci time-concurrency diagram 

In Figure|2]the time-concurrency diagram is shown 
that corresponds with our example, which shows the in- 
teractions between threads. T a is the main thread that 
executes the create, which then waits immediately us- 
ing sync on the termination of the created family of 
threads. The fibonacci threads tih---tn are then started, 
and all but the first will immediately block and suspend 
on reading the shared channels. The first thread that re- 
ceived the shared values from the parent can execute, 



and then passes on the values to the next thread. As Fi- 
bonacci requires the value of the n — 1th and the n — 2th 
iteration, the value from the shared channel pi is for- 
warded to p2 in each thread. Only when its shareds are 
written, a suspended thread will continue its execution 
again. When all threads have completed, the sync in 
the parent thread completes and it resumes its execution 
and can now safely use the results array. The writes to 
pi and p2 by the last thread could be read by the parent 
again after the sync, but are not used in this example. 

3. Distributed SVP 

As we have claimed in the introduction, the work 
described here can be applied to a whole range of pos- 
sible target architectures, we require a definition of what 
the distributed environment is that we want to apply 
SVP to, and how we represent this in the model. Then 
we will discuss how we identify the software compo- 
nents that we want to distribute, and how we identify 
which data to communicate. 

Distributed Environment in SVP We define our dis- 
tributed environment to consist of a set of processing re- 
sources which implement SVP, be it either in software 
or in hardware, and that are grouped into nodes of one 
or more of these resources. We define a node to have 
a single addressable, coherent, and optionally uniform, 
access to some memory. The nodes are interconnected 
by an infrastructure consisting of one or more, possibly 
heterogeneous, networks, on which each node can, di- 
rectly or indirectly, send a message to any other node. 
A place is identified as a subset of one or more (or all) 
resources within a single node, which therefore inherits 
the properties that we have just described. 

To give some more concrete examples; in a NUMA 
system which is not fully cache coherent, a node would 
be a group of processors that are in a single NUMA do- 
main that is internally cache coherent. A place would 
then be one or more of these processors. In the case of 
a networked (e.g. Ethernet) cluster of multi-core ma- 
chines, each machine would be a node and each core in 
a machine could be identified by its own place. How- 
ever, if these multi-core machines would be cache co- 
herent NUMA architectures themselves, one could op- 
tionally choose to subdivide these into separate nodes 
per NUMA domain to be able to express and exploit 
memory locality. As a final example, the Intel SCC ll28l 
does not provide any cache coherence, so a node and a 
place would be only a single core on the chip. 

It should be noted that within a single node, the 
classic definition of SVP works perfectly, and we only 
need to take into account interactions that are remote, 



i.e. that are between nodes, in order to apply it to a 
distributed environment. All SVP actions can be triv- 
ially translated into messages that can be sent across 
a network, and the place concept is nicely suited to 
capture the necessary addressing information on which 
node this place is physically located. By using a place 
on a remote node, a create transparently turns from a 
local concurrency control into a concurrent remote pro- 
cedure call. Threads in a family created this way can 
then again create more families there locally, or at some 
point decide to distribute their child families to other 
nodes again. However, the challenge lies in defining 
a way to handle a distributed memory organization in- 
stead of a loosely shared memory system. We need to 
define how, and at which level of granularity, we can 
identify parts of our program that we can distribute to 
other nodes. 

Software Components Using the restrictions that 
SVP imposes, we can make some assumptions about 
the structure of SVP programs. Because a program is 
structured as a hierarchical tree of concurrency, most 
computation, and therefore data production and/or con- 
sumption, takes place at the more fine grained concur- 
rency in the outer branches and leaf nodes in the tree. 
An application can be seen as a collection of software 
components at the outer branches, connected together 
with control code higher up the hierarchy. Due to the 
restrictions in communication and synchronization that 
SVP imposes, we can assume that these software com- 
ponents are relatively independent, and therefore are 
very suitable for distribution across different nodes. 

Having this view in mind, and by taking the mem- 
ory consistency model defined previously, we can make 
some further assumptions about the communication of 
data within an SVP program. As communication is re- 
stricted at the family level, where a thread can commu- 
nicate data through the shared and global channels to a 
family it creates, we can make the following observa- 
tion; The created threads will, disregarding global ref- 
erences, only access data that is either passed through 
these shared or global channels, or data in memory that 
is accessed through a reference that is passed this way. 
Newly generated data that needs to be communicated 
back to the parent, has to be passed back again through 
the shared channel. Therefore, the data dependencies 
of software components are identified by the shared and 
global channels to the top level family of such compo- 
nent. Threads accessing objects in memory through a 
global reference are the exception to this, but they have 
to be created on a specific exclusive place in order to 
guarantee consistency. 

Our strategy for building DSVP programs is based 



on the previous observations; we can identify software 
components at the level of a family and its children, 
that can be distributed to remote nodes with a create ac- 
tion using the corresponding place. This component can 
then internally create more threads locally on places on 
that node, or can decide at some point to create further 
sub components on other nodes. However, the whole 
component has a single interface at its top level family, 
and its dependencies are identified by the shared and 
global channels to that family. 

Distributed Memory As distribution is only done at 
the level of families, we can use the information in the 
channels to the created family to determine which data 
needs to be transferred. At the point of create, we syn- 
chronize or copy all objects that the family receives ref- 
erences to, to the node it is created on. As all threads of 
the created family run on the same place and therefore 
within the same consistent memory, such replication is 
not required for internal communication of objects be- 
tween sibling threads. When the family completes, at 
the point of sync, they are synchronized or copied back 
again, taking into account newly created references the 
family might send back through its shared channels. 
The second case where a family updates global state on 
an exclusive place is not an issue; as each family ac- 
cessing this data is created on the same exclusive place, 
it shares the same consistent memory, and no data com- 
munication is required besides the earlier defined inputs 
and outputs. 

This approach slightly restricts the original con- 
sistency model, as it delivers consistency only for the 
memory areas that the child family can effectively see. 
However, this approach is often too naive; for example, 
it does not keep track of how data is used. Depending on 
data being consumed or modified by the created family, 
we would like to avoid copying back unmodified data 
for efficiency, so an implementation has to detect or re- 
ceive hints on which data has been modified. Further- 
more, on more complex large objects, e.g. a database, 
do we suffice with a shallow copy or do we naively do 
an expensive deep copy of the object? And what about 
objects with a non-static size? 

Some of these issues can be solved in a DSVP im- 
plementation or on top of that, by using the notion of 
place as we presented it for a distributed environment; 
instead of plain memory references, objects could be 
referenced by a combination of memory location and 
place, as a place also identifies a memory range at- 
tached to a specific node. This way, a shallow copy of 
complex objects is sufficient given that it internally uses 
this kind of fat references, so that other referred objects 
can be fetched from the appropriate place on demand. 



We decided not to make this mechanism part of our 
model for flexibility. DSVP already provides the nec- 
essary constructs so that this can be done on top of any 
implementation. Another observation is that an imple- 
mentation would benefit from having more fine grained 
control over the inputs and outputs of a family, which 
requires a programming language where we can either 
analyze or specify in detail which data goes in, and 
which data is generated or modified by a software com- 
ponent. In the next section we will discuss our proto- 
type implementation which uses a C based language, in 
which this analysis is hard, and consequently we leave 
it to the programmer to explicitly specify this. After all, 
the designer of a component has the best knowledge of 
what its inputs and outputs are. 

4. Prototype Implementation over TCP/IP 

We have built a prototype implementation of DSVP 
using the mechanisms described in the previous sec- 
tion by extending the pthreads based implementation of 
SVP El] with messages over TCP/IP to signal the S VP 
actions between nodes. It supports heterogeneous clus- 
ters of multi-core systems, connected with for exam- 
ple an Ethernet network, where each system is a single 
node. This implementation is driven by programs writ- 
ten in the C based /iTC language (32), in which threads 
are declared in a similar manner to C functions. Addi- 
tional keywords are used to distinguish the shared and 
global channels in the arguments, but the input and out- 
put data is not explicitly indicated. This gives us the 
same problem as when attempting to analyze C func- 
tions; pointer arguments may carry input data, output 
data, or both, and manipulation of file-scope or global 
variables (side effects) is not indicated at all. Yet, we 
must know exactly which data will need to be sent to 
the remote place and back. Therefore, we require that 
the programmer, or anything that generates /xTC code, 
explicitly tells us what the complete set of input and 
output data is in a data description function. Besides 
being a requirement for our implementation, this also 
provides valuable documentation about the behaviour 
of a thread function. 

Data Description Functions A data description func- 
tion is a special function for each thread function which 
describes the inputs and outputs using special state- 
ments, allowing the corresponding thread function to 
be distributed to other nodes by our DSVP implemen- 
tation. This function receives the same arguments as 
the thread function, and is called by the implemen- 
tation at the creating and completing stage when the 
corresponding thread function is executed on a remote 



1 DISTRIBUTABLE.THREAD (fibonacci)(int pi 

2 intp2, int* result, int 

3 { 

4 INPUT(pl); 

5 INPUT(p2); 

6 for(int i = 2; i < N; i++) 

7 { 

8 OUTPUT( result [ i ] ); 

9 } 
10 } 

Figure 3. Data description function for Fi- 
bonacci 

node. The data description function contains INPUT(v), 
OUTPUT(v) and INOUT(v) statements, which trigger 
data transfers at the different stages. Data tagged with 
INPUT is copied to the remote node at the stage when 
the thread function is started by a create, and OUTPUT 
data is copied back to the creating node at the stage 
when the created family finishes and sync completes. 
INOUT is a shorthand notation for the combination of 
the previous two. 

Within these data description functions, loops and 
conditional expressions can be used around the state- 
ments describing input and output. This provides the 
flexibility needed in order to express the dynamic na- 
ture of family input/output data, for example dynami- 
cally sized arrays or the traversal of more complex data 
structures. In Figure |3]on page |6] we show how we can 
make the Fibonacci example code shown earlier in Fig- 
ure [T] on page [3] distributable by defining such a data 
description function. The startup values of the shared 
channels are only used as input to the Fibonacci func- 
tion, and the array with the generated sequence is sent 
back as output. Please note that we needed to add the 
size parameter to the thread function to support a non- 
fixed size for the result array. 

Using these data description functions we have a 
powerful way of expressing data dependencies and con- 
trolling which data goes into and comes out of a family 
of threads that is created on a remote node. Due to the 
restrictions on SVP programs, data only is communi- 
cated between two nodes at well known points, and no 
coherency protocols are required to keep data consis- 
tent. Because these data transfers are completely pro- 
grammable in our prototype implementation, full con- 
trol can be exercised over how data is distributed, for 
example for splitting up arrays or array subsets across 
multiple nodes. 



Types and Serialization For each thread function 
that needs to be distributable, the arguments should 
consist of distributable data types, i.e. data types that 
the implementation knows how to serialize and repre- 
sent on the network. Many standard C data-types are 
already provided as distributable, but more complex 
objects such as structs or structs linked with pointers 
must be defined using XDR [14], which allows a syn- 
tax similar to C. The XDR library provides us with 
(de)serialization, and guarantees data interoperability 
between different architectures so that we can support 
clusters of heterogeneous nodes. As long as a thread 
function is defined to be distributable, it can be created 
both remotely and locally. At run-time the implemen- 
tation checks if a create is to a local or a remote place, 
and only on a remote create will the (de)serialization be 
performed; the function can still be created locally with 
a negligible effect on performance compared to the non- 
distributed implementation. In the distributed fibonacci 
example we've just shown, we could have defined the 
results array as a new distributable data type with a 
known fixed length, and then directly handle it with a 
single OUTPUT() statement without the loop. Alterna- 
tively, it could have been made dynamic by passing the 
length as an argument and then using this in the loop 
bounds. 

Message Implementation We have implemented a 
simple socket protocol over TCP/IP to send events back 
and forth between nodes. The protocol consists of three 
messages only; 

• create - is sent to the remote node and contains 
parameters for the family to be created as well as 
the encoded input data. 

• sync (family finished) - is sent back from the re- 
mote node on completion, it includes return values 
and the encoded output data. 

• kill - is sent to a remote node to interrupt the exe- 
cution of a family, it contains information to iden- 
tify this family. 

As we can see from this enumeration, the nature of the 
messages is very simple, and induces minimal over- 
head. In general, the size of the encoded in- or output 
data will be the dominant factor of the message size. 
There is no message for the break action, as this it only 
applies to the family in which it is executed which will 
always be on the same place, and when recursing to 
child families it is the same as a recursing kill. Our 
current implementation does not contain any security, 
however this could be added easily by introducing ca- 



pabilities lfl3l on every message and the use of places, 
as well as by using encryption with SSL sockets. 

5. Evaluation and Discussion 

Latency Measurement We have measured the over- 
head imposed by our distributed implementation, by 
measuring the latency over a paired create and sync ac- 
tion on an empty thread function that executes remotely. 
For comparison, we compare it with the latency of nor- 
mal local creates of the same function, as well as re- 
mote creates to a second runtime instance running on 
the same machine over the internal loopback interface. 
These measurements give us insight in the startup cost 
for remote executions, and allows us to make decisions 
about the level of granularity at which it is still feasi- 
ble to delegate to a different node when using this im- 
plementation. The results of these measurements are 
shown as a histogram in Figure |4]on page|8]represent- 
ing the distribution of latency over 50000 connections. 
These experiments were all performed on Intel Dual- 
Core machines running Linux kernel 2.6, which were 
connected with a direct non-switched Gigabit Ethernet 
link. We see that creating a thread within another pro- 
cess on the same machine using the local loopback on 
average takes 1 \Apts, and through Gigabit Ethernet it 
takes 345 jis on average, but with 236 jis as a minimum. 
The 50/is wide gaps between the peaks that are ob- 
served in the Ethernet transmission are probably caused 
by an optimization in the TCP/IP stack of the host sys- 
tems that delays the delivery of ACK packets. Not 
surprisingly, the overhead for creating threads over the 
network is one order of magnitude slower than locally 
within the same runtime instance, which are on average 
created in around 30/J.s. The difference between local 
create and local loopback, is that in that case the whole 
protocol and serialization over a local TCP/IP socket is 
performed, as well as the scheduling between two dis- 
tinct processes. 

Reducing Overhead In alternative implementations 
of the methods proposed in this report, there is potential 
to greatly reduce the overhead compared to our proto- 
type implementation. For example, on a networked dis- 
tributed system with relatively homogeneous nodes, i.e. 
with the same internal data representation (including 
endian-ness), time can be saved on (de)serialization and 
encoding. On future many-core architectures or NUMA 
systems, the communication between targets will likely 
use an efficient low overhead internal messaging imple- 
mentation instead of TCP/IP, and the same argument 
against serialization holds. In fact, if we were to use 
such data description functions, they would probably 



be used to synchronize the data between the local and 
remote node by software coherency or memory dupli- 
cation. For the fully distributed heterogeneous platform 
it could perhaps be beneficial to investigate a protocol 
based on the much lighter UDP instead of the TCP/IP 
socket approach, as well as a run-time that supports a 
finer grain of threads than pthreads. 

Reference Transparency Many Distributed Shared 
Memory (DSM) implementations strive to have a form 
of reference transparency where a reference to an ob- 
ject can be accessed on any node. Usually this is done 
by using a shared address space, and optionally us- 
ing some special fat references which also encapsulate 
which node is the data's home location. As we have ar- 
gued earlier, such fat references can be built on top of 
DSVP as a combination between a place and a normal 
reference, and nothing prohibits a DSVP implementa- 
tion from using a distributed shared address space be- 
tween nodes. However, with the prototype implemen- 
tation that we have described here, and using the re- 
stricted consistency model of SVP, we achieve a similar 
programming model in a heterogeneous environment. 
Also the implementation can support a dynamic num- 
ber of nodes, as the interaction between nodes is only 
limited to the points where concurrency is created and 
synchronized. This is not easy to achieve in a system 
that attempts to maintain a single global address space. 

Mapping and Resource Management Even though 
some notions of resources and their organization are 
visible through the concept of places, the extension of 
the SVP model presented here is still resource agnos- 
tic. Using our prototype, we can specify other nodes by 
hand so that they appear as places to the user program 
and can remotely execute functions. We have also im- 
plemented a resource management system based on the 
SEP protocol ||3T1 which can do this dynamically. The 
details of that implementation are beyond the scope of 
this report, but it supports the dynamic aggregation of 
resources into a single DSVP system where nodes can 
join and leave the network at run time, offering a set of 
software components as services. A program can ac- 
quire a place on another node and then create there one 
of the software components that the node offers. As 
SVP only provides place as a hook to identify resources 
that a computation is bound to, and does not perform 
any mapping itself, an SEP like service that acts as a 
place-server handing out places could also take into ac- 
count mapping and placing software components effi- 
ciently. 
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Figure 4. Response time of TCP/IP implementation 



Fault Tolerance Distributed systems have the disad- 
vantage that communication is not always reliable. The 
communication link, or perhaps even the whole node 
might be unreliable or completely down. Besides dis- 
tributed systems, the many-core architectures of the fu- 
ture are not an exception to these kinds of problems; 
with 1000s of cores on a chip it is unavoidable that there 
will be faulty cores or interconnects present. Therefore, 
it is essential that software on such platforms is fault 
tolerant P31[T1. 

In our implementation, we can use retries up to a 
certain level to hide some of the communication prob- 
lems, unless a target is not responding within a reason- 
able amount of time. When waiting and retrying are not 
enough, we want to inform the application, which then 
may give up and display an error message, or could try 
to adapt itself to the new situation. If the application is 
looking for generic resources to execute a certain soft- 
ware component, it could try to get resources to execute 
it on another target instead of the failed one. In terms 
of SVP, this means sending a kill to the old family that 
is not responding, and creating a new family on a new 
place. As the input and output data of a software com- 
ponent are defined, it can easily be restarted on another 



target using the same input data again. Software com- 
ponents that do maintain state at a place are typically 
services, which are required to be implemented redun- 
dantly in such a system using replication. 

When the implementation cannot create a compo- 
nent on the desired target or gets notified that the target 
failed, it will have the corresponding sync return an in- 
dication that the family did not complete and the state 
of the output data is undefined. Similarly, if the com- 
ponent fails to complete within an application-defined 
time, it can be killed by a watchdog process, which is 
then reflected in the sync return value. In both cases, 
a new place should be selected, and the component is 
(re)started there. This kind of flexibility in a system can 
be very useful, not only to recover from communication 
errors but also to adapt to, for example, dynamically 
changing load or availability of resources. 

6. Related and Future Work 

Over the years, many ways of programming dis- 
tributed environments have been developed. There 
are distributed shared memory (DSM) implementations, 



which for example use implicit or explicit sharing of ob- 
jects GIl[H|40l|38l, re g ions ESEUEHHl, or an entire 
address space Il37l . The other end of the spectrum has 
been dominated by explicit message passing techniques 
ifTSl fT7l , and in between we have remote calls (possi- 
bly to remote objects) 01421123114611361139], which can 
also be based on web service interfaces (£)• We will 
now discuss some of these approaches in more detail, 
and compare them with DSVP. 

Ivy 1371 was one of the first DSM systems that 
attempted to act as a transparent single address space 
shared memory system by sharing memory on the 
page level and using handlers on page miss to trans- 
fer data. However, this did not turn out to work effi- 
ciently enough, false sharing being one of the issues, 
and many later DSM implementations are based on ex- 
plicitly acquiring, reading or modifying and releasing 
state. CRL ll33ll for example uses a region based ap- 
proach where special global pointers are used to map 
and unmap shared regions of arbitrary size to code run- 
ning on a node. After a region is mapped, the code can 
enter either a reading or writing section, where writ- 
ing sections guarantee exclusive access. Munin iflOl 
also uses the acquire/release principle, but allows the 
consistency protocol, which is based on release consis- 
tency [ 19 1, to be configured for individual objects; i.e. 
invalidate or update copies on write, enabling replica- 
tion and fixed home locations. Cid [38 1 also implements 
acquire/release with single writer multiple readers, but 
also exposes the location of objects with the ability to 
start a computation on an object on the node where it 
is located, providing the flexibility of moving either the 
computation or the data. 

In Orca [40 1 the acquire/release happens transpar- 
ently on shared objects that get replicated. The ob- 
jects are not globally visible but are passed by refer- 
ence between (concurrent) invocation of functions, lim- 
iting their visibility to a relatively local scope similar as 
in DSVP. However, when multiple functions operate on 
the same object it is kept coherent by updating or inval- 
idating copies on write. Emerald [34] provided similar 
mechanisms, however it did not support replication and 
therefore did not have to deal with coherency. 

CICO [27 1 is a cooperative model in which mem- 
ory regions in a shared address space can be checked 
out, in and prefetched, which provides a hinting mech- 
anism for a hardware based coherency implementation, 
similarlar to how we see that the data description func- 
tion annotations could be used on a NUMA style sys- 
tem. This restricted way in which we move data in 
and out of created families, has some similarities and 
provides the same advantage as the DAG-consistency 
[4 1 provided in Cilk [5'j; in both there are well defined 



points when data needs to be communicated, as there is 
no strict coherency which requires propagation of up- 
dates as soon as data is modified. Another approach 
that matches our work even more closely is CellSc (|2 
which uses compiler pragmas to annotate functions with 
their input and output signature to efficiently write pro- 
grams for the distributed memory in the Cell ll24l ar- 
chitecture. Sequoia ITT31 is a programming model in 
which a (distributed) system is viewed as a hierarchy 
of memories, and, similar to SVP, programs in Sequioa 
can be automatically adopted to the granularity of the 
target system. Sequoia uses call-by-value-result seman- 
tics, where for each function argument is specified if it 
describes an input, output or both. GMAC [18] is an 
implementation of an asynchronous distributed shared 
memory which attempts to unify the programmability 
of CPU and GPU memories. The Batch-update mode 
of GMAC matches closely with our approach to con- 
sistency, however it also supports more elaborate co- 
herency protocols where the GPU can receive updated 
data from the CPU asynchronously. 

In our definition of DSVP we unify the use of 
distributed memory and dynamic concurrency manage- 
ment. Unifying the creation of local and remote con- 
currency was investigated widely in the 90s, but was 
considered a bad idea back then [44 1. This makes sense 
as a remote execution on a cluster takes many orders of 
magnitude more latency, and partial failure exposes dif- 
ferent failure patterns. However, we are on the brink of 
the many-core era and things have changed. With many 
cores on one chip, starting an execution from one core 
on another will be orders of magnitude faster than on 
a cluster. And with thousands of cores on a chip, fault 
tolerance needs to be supported to cater for failing cores 
and communication links |[T||45|. Checking for failure 
on any concurrent invocation would still be expensive, 
but can be done at the software component level as dis- 
cussed earlier. R-OSGi ll39l is a system that takes this 
into account, it distributes transparently at the software 
module level, and does not introduce any new failure 
patterns. Similarly to our prototype implementation, 
it does not impose any role assignments i.e. whether 
a node acts as a client or server; the relation between 
modules is symmetric. Chapel ifTTI is a new program- 
ming language aimed to bridge the gap between parallel 
and sequential programming. Similarly to DSVP, it hi- 
erarchically expresses both task and data level concur- 
rency, which transparently can be executed locally or 
remotely in parallel, or sequential, but it does not deal 
with partial failure. X10 [12] is similar in that respect 
and is developed with the same goal as Chapel. It bears 
more similarities to SVP with its futures and final vari- 
ables which resemble our shared and global channels. It 



also uses places to express locations that have sequen- 
tial consistency, which provides a handle for expressing 
locality. Cid ||38l has this feature as well in a way, as 
the home node of a piece of data can be extracted. This 
can then be used with its fork if remote construct, exe- 
cuting sequentially if the referenced object is local, or 
otherwise remotely in parallel. 

Other approaches such as Active Messages ||421 . 
CORBA El, Legion |36), RPC 0, Java RMI O and 
SOAP |9| but also message passing approaches such as 
MPI-2 (T3 and PVM (ill are based on coarse grained 
parallelism where finer grained parallelism must be ex- 
pressed in something else; for example in a separate 
threading implementation. MPI-2 and PVM support the 
dynamic creation of tasks, but again, only at task level 
parallelism. Most of these approaches support partial 
failure, but at the cost of not making remote communi- 
cation transparent. None of them provide a distributed 
memory abstraction, though CORBA, Java RMI and 
Legion do this in a way by accessing remote objects. A 
lookup service is provided to locate these objects, which 
can be added to DSVP by an SEP 1 3 1 1 implementation. 

Many of the discussed approaches rely on new lan- 
guages or language features, while others will work as 
pure library implementations. DSVP does not exclude 
either of the two approaches; the prototype implemen- 
tation uses a C based language as input, but this is trans- 
lated to pure C++ with library calls [41] behind the 
scenes. Of course, the argument for a language ap- 
proach would be to be more friendly or efficient for 
the programmer, but in our current toolchains SVP is 
seen as an intermediate low level representation. There 
are already tools to compile from SAC lETl|2"0l , a high 
level array programming language, and an SVP based 
runtime for S-Net [22. 8|, a coordination language for 
streaming networks, has been developed. As future 
work, we see that these tools can solve the problem of 
efficiently describing the data dependencies as required 
for DSVP. These are well known in the higher level rep- 
resentations of SAC and S-Net, and could be automati- 
cally generated when compiling down to DSVP. 

More future work lies in applying DSVP to emerg- 
ing many-core architectures, either in hardware or low- 
level software. We are currently working on an imple- 
mentation on the Intel SCC |28l , an experimental 48- 
core processor created by Intel as a 'concept vehicle' 
platform for many-core software research. This plat- 
form's NUMA style memory organization and lack of 
cache coherence fit well with the distributed style of 
memory for which DSVP was developed. We hope to 
exploit the efficient on chip network for communication 
and delegation, as well as its ability to change the mem- 
ory mapping of each core. 



7. Conclusion 

In this report we have discussed how we can apply 
the SVP model of concurrency to platforms with dis- 
tributed memory organizations, which is important in 
order to support decentralized memory organizations in 
future many-core architectures. We came to the con- 
clusion that as long as we can identify software compo- 
nents and their data dependencies in SVP programs, we 
can trivially distribute them across multiple distributed 
memory domains. This approach fits the original mem- 
ory consistency model of SVP and still exposes the 
same restricted-consistency shared memory behavior. 

We have discussed our prototype software imple- 
mentation that we used to explore this domain. It can 
run SVP applications on TCP/IP networks of heteroge- 
neous nodes, and uses data description functions to cap- 
ture the dynamic nature of input and output data. We 
identified the minimal latencies imposed by this imple- 
mentation to give an indication at which level of granu- 
larity it can be used efficiently. However, the main con- 
tribution are the techniques explored here that can be 
used as a basis for more fine grained SVP implementa- 
tions, applied to future or current many-core architec- 
tures with distributed or non cache-coherent memory. 
As such architectures and also distributed systems can 
suffer from partial failure, we have shown how the com- 
bination of the accurate description of input and output 
data and restricted points of communcation in SVP can 
aid in the recovery of failure at the software component 
level. 
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